Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).
[Q] How to determine if a review is positive or negative?
[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.
The dataset is available in two forms
In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualise the data efficiently.
Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from prettytable import PrettyTable
from tqdm import tqdm_notebook
import os
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite')
# filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power
# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con)
# for tsne assignment you can take 5k data points
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 100000""", con)
# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
if x < 3:
return 0
return 1
#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)
print(display.shape)
display.head()
display[display['UserId']=='AZY10LLTJ71NX']
display['COUNT(*)'].sum()
It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()
As it can be seen above that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on
It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.
The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100
Observation:- It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)
display.head()
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)
#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()
Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.
Hence in the Preprocessing phase we do the following in the order below:-
After which we collect the words used to describe positive and negative reviews
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)
sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)
sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)
sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)
print(sent_0)
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup
soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)
soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)
soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)
soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"])
# Combining all the above stundents
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm_notebook(final['Text'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'lxml').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
# https://gist.github.com/sebleier/554280
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
preprocessed_reviews.append(sentance.strip())
preprocessed_reviews[1500]
## Similartly you can do preprocessing for review summary also.
import warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
preprocessed_summary = []
# tqdm is for printing the status bar
for sentance in tqdm_notebook(final['Summary'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'lxml').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
# https://gist.github.com/sebleier/554280
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
preprocessed_summary.append(sentance.strip())
#BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(preprocessed_reviews)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)
final_counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])
#bi-gram, tri-gram and n-gram
#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)
final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
list_of_sentance.append(sentance.split())
# Using Google News Word2Vectors
# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict ,
# and it contains all our courpus words as keys and model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin"
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.
# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need
is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True
if want_to_train_w2v:
# min_count = 5 considers only words that occured atleast 5 times
w2v_model=Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
print(w2v_model.wv.most_similar('great'))
print('='*50)
print(w2v_model.wv.most_similar('worst'))
elif want_to_use_google_w2v and is_your_ram_gt_16g:
if os.path.isfile('GoogleNews-vectors-negative300.bin'):
w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(w2v_model.wv.most_similar('great'))
print(w2v_model.wv.most_similar('worst'))
else:
print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(list_of_sentance): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model = TfidfVectorizer()
tf_idf_matrix = model.fit_transform(preprocessed_reviews)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf
tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(list_of_sentance): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_vectors.append(sent_vec)
row += 1
with X-axis as min_sample_split, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynbor
seaborn heat maps with rows as min_sample_split, columns as max_depth, and values inside the cell representing AUC Score 

# Please write all the code with proper documentation
import numpy as np
import pandas as pd
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn import tree
bow_vect=CountVectorizer()
x=preprocessed_reviews
y=np.array(final['Score'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
x_train,x_cv,y_train,y_cv=train_test_split(x_train,y_train,test_size=0.3)
fbowx_tr=bow_vect.fit_transform(x_train)
fbowx_cv=bow_vect.transform(x_cv)
fbowx_te=bow_vect.transform(x_test)
std=StandardScaler(with_mean=False) #Standardizing Data
fbowx_tr=std.fit_transform(fbowx_tr)
fbowx_cv=std.transform(fbowx_cv)
fbowx_te=std.transform(fbowx_te)
dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(fbowx_tr,y_train)
prob_c=dt.predict_proba(fbowx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)
probcv=dt.predict_proba(fbowx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(fbowx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(fbowx_tr,y_train)
prob_c=dt.predict_proba(fbowx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(fbowx_tr,y_train)
probcv=dt.predict_proba(fbowx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(fbowx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(fbowx_tr,y_train)
pred_te=dt.predict_proba(fbowx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(fbowx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
def find_best_threshold(threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
return t
def predict_with_best_t(proba, threshould):
predictions = []
for i in proba:
if i>=threshould:
predictions.append(1)
else:
predictions.append(0)
return predictions
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
# Please write all the code with proper documentation
all_features = bow_vect.get_feature_names()
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)
features=np.argsort(dt.feature_importances_)[::-1]
for i in features[0:20]:
print(all_features[i])
from sklearn import tree
from graphviz import Source
import graphviz
feat = bow_vect.get_feature_names()
Source(tree.export_graphviz(dt, out_file = None, feature_names = feat,max_depth=3))
#https://stackoverflow.com/questions/27817994/visualizing-decision-tree-in-scikit-learn
# Please write all the code with proper documentation
tf_vect=TfidfVectorizer(ngram_range=(1,2),min_df=10)
#tf_vect.fit(preprocessed_reviews)
ftfx_tr=tf_vect.fit_transform(x_train)
ftfx_cv=tf_vect.transform(x_cv)
ftfx_te=tf_vect.transform(x_test)
std = StandardScaler(with_mean=False)
ftfx_tr=std.fit_transform(ftfx_tr)#Standardizing Data
ftfx_cv=std.transform(ftfx_cv)
ftfx_te=std.transform(ftfx_te)
dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(ftfx_tr,y_train)
prob_c=dt.predict_proba(ftfx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)
probcv=dt.predict_proba(ftfx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(ftfx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(ftfx_tr,y_train)
prob_c=dt.predict_proba(ftfx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(ftfx_tr,y_train)
probcv=dt.predict_proba(ftfx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(ftfx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
# Please write all the code with proper documentation
all_features = tf_vect.get_feature_names()
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)
features=np.argsort(dt.feature_importances_)[::-1]
for i in features[0:20]:
print(all_features[i])
# Please write all the code with proper documentation
from sklearn import tree
from graphviz import Source
import graphviz
feat = tf_vect.get_feature_names()
Source(tree.export_graphviz(dt, out_file = None, feature_names = feat,max_depth=3))
# Please write all the code with proper documentation
#Avg word2vec for train data
sent_train_list=[]
for sentence in x_train:
sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
sent_train_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_train_vectors.append(sent_vec)
print(len(sent_train_vectors))
print(len(sent_train_vectors[0]))
#Avg word2vec for cv data
sent_cv_list=[]
for sentence in x_cv:
sent_cv_list.append(sentence.split())
sent_cv_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_cv_vectors.append(sent_vec)
print(len(sent_cv_vectors))
print(len(sent_cv_vectors[0]))
#Avg word2vec for test data
sent_test_list=[]
for sentence in x_test:
sent_test_list.append(sentence.split())
sent_test_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_test_vectors.append(sent_vec)
print(len(sent_test_vectors))
print(len(sent_test_vectors[0]))
#This code is copied and modified from :https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW#scrollTo=3-XGItt4PSx0
aw2vx_tr=sent_train_vectors
aw2vx_cv=sent_cv_vectors
aw2vx_te=sent_test_vectors
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(aw2vx_tr,y_train)
prob_c=dt.predict_proba(aw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(aw2vx_tr,y_train)
probcv=dt.predict_proba(aw2vx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(aw2vx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(aw2vx_tr,y_train)
prob_c=dt.predict_proba(aw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(aw2vx_tr,y_train)
probcv=dt.predict_proba(aw2vx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(aw2vx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
# Please write all the code with proper documentation
sent_train_list=[]
for sentence in x_train:
sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),min_df=10, max_features=500)
tf_idf_matrix=tf_idf_vect.fit_transform(x_train)
tfidf_feat = tf_idf_vect.get_feature_names()
dictionary = dict(zip(tf_idf_vect.get_feature_names(), list(tf_idf_vect.idf_)))
#Train data
tfidf_sent_train_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_train_vectors.append(sent_vec)
row += 1
#for cv
sent_cv_list=[]
for sentence in x_cv:
sent_cv_list.append(sentence.split())
tfidf_sent_cv_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_cv_vectors.append(sent_vec)
row += 1
#Test data
sent_test_list=[]
for sentence in x_test:
sent_test_list.append(sentence.split())
tfidf_sent_test_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_test_vectors.append(sent_vec)
row += 1
tfw2vx_tr=tfidf_sent_train_vectors
tfw2vx_cv=tfidf_sent_cv_vectors
tfw2vx_te=tfidf_sent_test_vectors
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(tfw2vx_tr,y_train)
prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(tfw2vx_tr,y_train)
probcv=dt.predict_proba(tfw2vx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(tfw2vx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(tfw2vx_tr,y_train)
prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(tfw2vx_tr,y_train)
probcv=dt.predict_proba(tfw2vx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(tfw2vx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(tfw2vx_tr,y_train)
pred_te=dt.predict_proba(tfw2vx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(tfw2vx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
for i in range(len(preprocessed_reviews)): #considering some features from reviw summary and length of review text
preprocessed_reviews[i]=preprocessed_reviews[i]+ ' '+preprocessed_summary[i]+' '+str(len(final.Text.iloc[i]))
preprocessed_fe_reviews=preprocessed_reviews
preprocessed_fe_reviews[1500]
bow_vect=CountVectorizer()
x=preprocessed_fe_reviews
y=np.array(final['Score'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
x_train,x_cv,y_train,y_cv=train_test_split(x_train,y_train,test_size=0.3)
fbowx_tr=bow_vect.fit_transform(x_train)
fbowx_cv=bow_vect.transform(x_cv)
fbowx_te=bow_vect.transform(x_test)
std=StandardScaler(with_mean=False) #Standardizing Data
fbowx_tr=std.fit_transform(fbowx_tr)
fbowx_cv=std.transform(fbowx_cv)
fbowx_te=std.transform(fbowx_te)
dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(fbowx_tr,y_train)
prob_c=dt.predict_proba(fbowx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)
probcv=dt.predict_proba(fbowx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(fbowx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(fbowx_tr,y_train)
prob_c=dt.predict_proba(fbowx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(fbowx_tr,y_train)
probcv=dt.predict_proba(fbowx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(fbowx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(fbowx_tr,y_train)
pred_te=dt.predict_proba(fbowx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(fbowx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]
layout = go.Layout(scene = dict(
xaxis = dict(title='max_depth'),
yaxis = dict(title='max_split'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
# Please write all the code with proper documentation
tf_vect=TfidfVectorizer(ngram_range=(1,2),min_df=10)
#tf_vect.fit(preprocessed_reviews)
ftfx_tr=tf_vect.fit_transform(x_train)
ftfx_cv=tf_vect.transform(x_cv)
ftfx_te=tf_vect.transform(x_test)
std = StandardScaler(with_mean=False)
ftfx_tr=std.fit_transform(ftfx_tr)#Standardizing Data
ftfx_cv=std.transform(ftfx_cv)
ftfx_te=std.transform(ftfx_te)
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(ftfx_tr,y_train)
prob_c=dt.predict_proba(ftfx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)
probcv=dt.predict_proba(ftfx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(ftfx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(ftfx_tr,y_train)
prob_c=dt.predict_proba(ftfx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(ftfx_tr,y_train)
probcv=dt.predict_proba(ftfx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(ftfx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]
layout = go.Layout(scene = dict(
xaxis = dict(title='max_depth'),
yaxis = dict(title='max_split'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
# Please write all the code with proper documentation
#Avg word2vec for train data
sent_train_list=[]
for sentence in x_train:
sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
sent_train_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_train_vectors.append(sent_vec)
print(len(sent_train_vectors))
print(len(sent_train_vectors[0]))
#Avg word2vec for cv data
sent_cv_list=[]
for sentence in x_cv:
sent_cv_list.append(sentence.split())
sent_cv_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_cv_vectors.append(sent_vec)
print(len(sent_cv_vectors))
print(len(sent_cv_vectors[0]))
#Avg word2vec for test data
sent_test_list=[]
for sentence in x_test:
sent_test_list.append(sentence.split())
sent_test_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
sent_test_vectors.append(sent_vec)
print(len(sent_test_vectors))
print(len(sent_test_vectors[0]))
#This code is copied and modified from :https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW#scrollTo=3-XGItt4PSx0
aw2vx_tr=sent_train_vectors
aw2vx_cv=sent_cv_vectors
aw2vx_te=sent_test_vectors
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(aw2vx_tr,y_train)
prob_c=dt.predict_proba(aw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(aw2vx_tr,y_train)
probcv=dt.predict_proba(aw2vx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(aw2vx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(aw2vx_tr,y_train)
prob_c=dt.predict_proba(aw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(aw2vx_tr,y_train)
probcv=dt.predict_proba(aw2vx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(aw2vx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]
layout = go.Layout(scene = dict(
xaxis = dict(title='max_depth'),
yaxis = dict(title='max_split'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
# Please write all the code with proper documentation
sent_train_list=[]
for sentence in x_train:
sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),min_df=10, max_features=500)
tf_idf_matrix=tf_idf_vect.fit_transform(x_train)
tfidf_feat = tf_idf_vect.get_feature_names()
dictionary = dict(zip(tf_idf_vect.get_feature_names(), list(tf_idf_vect.idf_)))
#Train data
tfidf_sent_train_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_train_vectors.append(sent_vec)
row += 1
#for cv
sent_cv_list=[]
for sentence in x_cv:
sent_cv_list.append(sentence.split())
tfidf_sent_cv_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_cv_vectors.append(sent_vec)
row += 1
#Test data
sent_test_list=[]
for sentence in x_test:
sent_test_list.append(sentence.split())
tfidf_sent_test_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
sent_vec = np.zeros(50) # as word vectors are of zero length
weight_sum =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in w2v_words and word in tfidf_feat:
vec = w2v_model.wv[word]
# tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
# to reduce the computation we are
# dictionary[word] = idf value of word in whole courpus
# sent.count(word) = tf valeus of word in this review
tf_idf = dictionary[word]*(sent.count(word)/len(sent))
sent_vec += (vec * tf_idf)
weight_sum += tf_idf
if weight_sum != 0:
sent_vec /= weight_sum
tfidf_sent_test_vectors.append(sent_vec)
row += 1
tfw2vx_tr=tfidf_sent_train_vectors
tfw2vx_cv=tfidf_sent_cv_vectors
tfw2vx_te=tfidf_sent_test_vectors
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
ms,rc=0,0
#print(d)
for s in min_splits:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(tfw2vx_tr,y_train)
prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
ms=s
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(tfw2vx_tr,y_train)
probcv=dt.predict_proba(tfw2vx_cv)[:,1]
auc_cv.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(tfw2vx_tr)[:,1]
auc_train.append(roc_auc_score(y_train,probtr))
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]
plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')
plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
dep,rc=0,0
#print(d)
for d in depths:
#print(m)
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
dt.fit(tfw2vx_tr,y_train)
prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
val=roc_auc_score(y_cv,prob_c)
if val>rc:
rc=val
dep=d
dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(tfw2vx_tr,y_train)
probcv=dt.predict_proba(tfw2vx_cv)[:,1]
auc_cv_s.append(roc_auc_score(y_cv,probcv))
best_m.append(ms)
probtr=dt.predict_proba(tfw2vx_tr)[:,1]
auc_train_s.append(roc_auc_score(y_train,probtr))
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')
plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(tfw2vx_tr,y_train)
pred_te=dt.predict_proba(tfw2vx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(tfw2vx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)
plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]
layout = go.Layout(scene = dict(
xaxis = dict(title='max_depth'),
yaxis = dict(title='max_split'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
# Please compare all your models using Prettytable library
x=PrettyTable()
x.field_names=(['Vectorizer','Best_Depth','Best_Split','AUC','Feature Engineering'])
x.add_row(['BOW',50,500,0.822,'NO'])
x.add_row(['TF-IDF',50,500,0.812,'NO'])
x.add_row(['AW2V',500,500,0.778,'NO'])
x.add_row(['TF-IDF_w2v ',10,500,0.749,'NO'])
x.add_row(['BOW',50,500,0.857,'Yes'])
x.add_row(['TF-IDF',50,500,0.856,'Yes'])
x.add_row(['AW2V',10,500,0.807,'Yes'])
x.add_row(['TF-IDF_w2v',10,500,0.783,'Yes'])
print(x)
After feature engineerinng there is a slight increment in the accuracy score